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Abstract 



By analogy with Monte Carlo algorithms, we propose new strategies for design and 
redesign of small molecule libraries in high-throughput experimentation, or combina- 
torial chemistry. Several Monte Carlo methods are examined, including Metropolis, 



three types of biased schemes, and composite moves that include swapping or parallel 
tempering. Among them, the biased Monte Carlo schemes exhibit particularly high 
efficiency in locating optimal compounds. The Monte Carlo strategies are compared to 
a genetic algorithm approach. Although the best compounds identified by the genetic 
algorithm are comparable to those from the better Monte Carlo schemes, the diversity 
of favorable compounds identified is reduced by roughly 60%. 
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1 Introduction 

High-throughput synthesis is now established as one of the methods for the discovery of 
new drugs, materials, and catalysts. High-throughput, or combinatorial, methods allow for 
simultaneous creation of a large number of structurally diverse and complex compounds,Era 
generalizing the traditional techniques of single compound synthesis. Parallel synthesis and 
split /pool synthesiser on solid phase, for example, are two commonly used methods for 
combinatorial synthesis. After combinatorial synthesis of the desired number of compounds, 
high-throughput screening is used to identify the few molecules optimally possessing the 
property of interest. Among the high-throughput methodologies, small molecule combina- 
torial chemistry is the most developed and has been applied successfully in areas such as 
transition metal complexation,Q chemical genetic screening)!! catalysisja and drug discovery^ 

High-throughput chemistry can be viewed as a search over a multi-dimensional space 
of composition variables for molecules possessing a high degree of the desired function, or 
figure of merit. Indeed, the parallel synthesis and split/pool synthesis methods search the 
composition space in a regular, grid-like fashion. As the complexity of the molecular library 
grows, the number of dimensions in the composition variable space grows, and with a grid- 
like method, the number of compounds that must be synthesized to search the space grows 
exponentially. Synthesis and screening of mixtures of compounds can partially alleviate the 
dimensional cursed However, a mixture approach raises the question of how to deconvolute 
and interpret the results. The greater the degree of mixing, the stronger the synergistic 
effects can be in the mixture, and the more difficult it is to identify individual compounds 
responsible for the activity.^ 

The challenge of searching the composition space in an efficient way has led to extensive 
efforts in the rational design of combinatorial, or high-throughput, libraries. A basic assump- 
tion in library design is that structurally similar compounds tend to display similar activity 
profiles. By designing libraries with maximum structural diversity, the potential for finding 
active compounds in the high-throughput screenings can be enhanced. This design approach 
requires a quantitative account of the structural and functional diversity of the library, and 
many descriptors have been developed.t3 Optimization of a library to maximum diversity is 
then driven by a reliable statistical method. Several structurally diverse libraries have been 
successfully designed along these linesHrEil For example, strategies have been presented to 
optimize the structural diversity of libraries of potential substituents or entire molecules 
by using stochastic optimization of diversity functions and a point mutation Monte Carlo 
technique.ta Peptide libraries have been designed by using topological descriptors and quan- 
titative structure-activity relationships combined with a genetic algorithm and simulated 
annealing.liBiIa Diverse libraries of synthetic biodegradable polymers have been designed by 
using molecular topology descriptors and a genetic algorithmic Similarly, peptoid libraries 
have been designed by using multivariate quantitative structure-activity relationships and 
statistical experimental design.^ 

The question of how an initial library should be redesigned for subsequent rounds of 
high-throughput experimentation in light of the results of the first round of screening remains 
unanswered. In this paper, we suggest that Monte Carlo methods provide a natural means 
for library redesign in high-throughput experimentation. The Monte Carlo method is a well- 
known statistical method for sampling large spaces efficiently and ergodically. There is a 



striking analogy between searching configuration space for regions of low free energy in a 
Monte Carlo simulation and searching composition space for regions of high figure of merit in 
a combinatorial chemistry experiment. Importantly, Monte Carlo methods do not suffer the 
curse of dimensionality. A Monte Carlo approach should, therefore, be exponentially more 
efficient than a regular, grid-like method for libraries of complex molecules. Indeed, a Monte 
Carlo approach to materials discovery proves to be dramatically more efficient than does 
a grid-based approach£U The application of Monte Carlo methods to small molecule high- 
throughput experimentation differs from the conventional computer simulation technique 
in several aspects. First, the variables in molecular high-throughput experimentation are 
discrete, and no continuous moves are available. Second, multiple simultaneous searches 
of the variable space are performed in high-throughput experimentation when screening 
the large libraries. Finally, temperature has a natural meaning in Monte Carlo computer 
simulation, whereas "temperature" is simply a control parameter in a Monte Carlo protocol 
for high-throughput experimentation. In principle, the temperature in the protocol serves 
to specify how strong is the differentiation between compounds with low and high figures of 
merit. In practice, temperatures that are too low may cause the method to become trapped 
in local optima unless a sufficiently powerful Monte Carlo scheme is used. 

Despite the many successes of high-throughput experimentation, the method has been 
criticized as a simple machinery, lacking incorporation of a priori knowledge when com- 
pared with the traditional synthetic approach. A priori knowledge, such as chemical intu- 
ition, previous database or experimental information, well-known theory, patentability, or 
other specific constraints, are indispensable to an efficient library design and are the tradi- 
tional province of the synthetic chemist. Fascinatingly, the Monte Carlo approach to high- 
throughput experimentation can naturally incorporate such knowledge in the experimental 
design through the technique of biased Monte Carlo. 

Genetic algorithms are the computational analog of Darwinian evolution. Typically, a 
genetic algorithm consists of three basic processes: crossover, mutation, and selection. In 
the crossover step, new compounds are generated by mixing the compositions of parent com- 
pounds. In the mutation step, individual molecules are changed at random. In the selection 
step, the best molecules are identified for the next round. The application of genetic algo- 
rithms to combinatorial synthesis and library design has achieved considerable success.c3c3 
Nonetheless, unlike Monte Carlo algorithms, genetic algorithms do not satisfy detailed bal- 
ance. Because of this, genetic algorithms can not be guaranteed to sample properly the 
variable space or to locate optimal molecules. Furthermore, in most experiments, one wants 
to identify several initially promising molecules in the hope that, among them a few can sur- 
vive further stringent screenings, such as patentability or lack of side effects." In the genetic 
approach, however, all the molecules in the library tend to become similar to each other due 
to the crossover step. While diversity can be encouraged in a genetic approach,B3'E3 diversity 
can never be guaranteed. The Monte Carlo approach, on the other hand, can maintain or 
even increase the diversity of a molecular library, due to the satisfaction of detailed balance. 

In this paper, we propose several strategies for small molecule high-throughput experi- 
mentation derived by analogy with Monte Carlo methods. We compare these Monte Carlo 
protocols to the genetic algorithm approach. In order to make this comparison and to demon- 
strate the effectiveness of the Monte Carlo approach, we perform simulated high-throughput 
experiments. In section 2, we introduce a random energy model that we use as a surro- 



gate for experimental measurement of the figure of merit. The random energy model is not 
fundamental to the protocols; it is introduced as a simple way to test, parameterize, and 
validate the various searching methods. In an experimental implementation, the random 
energy model would be replaced by the value returned by the screen. In section 3, we intro- 
duce the Monte Carlo protocols and provide a means to calculate the diversity of a library. 
In section 4, we compare the various protocols. In section 5, we discuss some implications 
of these results. We conclude in section 6. 

2 Space of Variables and A Random Energy Model for 
Small Molecule High-Throughput Experimentation 

To quantitatively describe the molecules in a high-throughput library, we uniquely char- 
acterize each molecule by its composition, such as the identity of the core and substituents. 
For specificity, we will consider the figure of merit of interest to be a binding constant, but 
our results should be generically valid. A schematic view of our model is presented in Figure 
[I]. For simplicity, we consider the small molecule to consist of one core, drawn from a library 
of cores, and six binding substituents, each drawn from a single library of substituents.o 
Numerous energetic interactions could exist between this molecule and the substrate. It is 
commonly believed that descriptors can be directly related to compound performance. A 
large class of descriptors, such as one-dimensional, two-dimensional, three-dimensional, and 
BCUT descriptors, has been used to measure the diversity between substituents, cores, and 
molecules in the literature.lI3Tij To simplify, we will limit ourselves to a set of six weakly 
correlated descriptors for each substituents and core. For example, the descriptors could be 
hydrogen bond donors, hydrogen bond acceptors, flexibility, an electro-topological calcula- 
tion, clogP, and aromatic density.E-a 

To carry out the simulated experiments, we need a figure of merit function that mimics 
the experimental step of measuring the figure of merit. Once constructed or "synthesized," 
the molecules are scored by the model, which takes the composition or molecular descriptors 
as input. A random energy model can mimic the generic features of an experimental figure of 
merit. For example, the NK model is used to model combinatorial chemistry experiments on 
peptides,^ the block NKEil and generalized NK0 models are used to model protein molecular 
evolution experiments, and the random phase volume model is used to model materials 
discovery.El 

The basic building block for our random energy model is a random polynomial of n 
descriptors, x%, . . . ,x r , 
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We choose to take the square root of / here because we consider each term of the un- 
symmetrized polynomial to be random. The scale factor £ is used to equalize roughly the 
contributions from each term of the polynomial. Since Xi will be drawn from a Gaussian 
random distribution of zero mean and unit variance, we set 

*-(&))**-( 1 ] )^ (3) 

We use q = 6 and n = 6 in our random energy model. 

The random energy model accounts for contributions to substrate binding arising from 
interactions between the substrate and core and from interactions between the substrate 
and each of the substituents. In addition, synergistic effects between the substituents and 
core are incorporated. Consider, for example, a molecule made from core number m from 
the core library and substituents number si, . . . , sq from the substituent library. The core is 
characterized by six descriptors, D^ , . . . , Dg . Similarly, each substituent is characterized 
by six descriptors, d\ , . . . , d 6 . We denote the core contribution to binding by E c and the 
substituent contributions by E$. We denote the contribution due to synergistic substituent- 
substituent interactions by Ess and the contribution due to synergistic core-substituent 
interactions by -Ecs- The total contribution to the figure of merit is, then, 

E = E s + E c + E ss + E cs (4) 

Each of these factors is given in terms of the random polynomial: 

E s = ai ]TF(4^...,4^{G s }) (5) 

i=l 

E c = a 2 F(Dt\...,Dt\{Gc}) (6) 

£ss = a 3 J2h i F(d^\d%\d%\d^\d^\d%+ 1 \{G S8 }) (7) 



Ecs = *±Y.hiF{d%\dt\d^,D%\D%\Dt:\{G C s}) (8) 



where the {Gs}, {Gc}, {Gss}, an d {Gcs} are four sets of fixed random Gaussian variables 
with zero mean and unit variance. The ckj are constants to be adjusted so that the synergistic 
terms will contribute in desired percentages, and hi is a structural constant indicating the 
strength of the interaction at binding site i. The interaction strengths hi are chosen from 
a Gaussian distribution of zero mean and unit variance for each site on each core. Only 
synergistic interactions between neighboring substituents are considered in Ess, an d it is 
understood that sj refers to s± in eq [7|. In principle, the polynomial in eq [7] could be a 
function of all 12 descriptors of both substituents. We assume, however, that important 
contributions come from interactions among three randomly chosen distinct descriptors of 
substituent Si,^* , df 2 , and df 3 , and another three randomly chosen distinct descriptors 

of substituent Si+\,dj^ + , df 5 l+ , and dj 6 . Similarly, we assume that core-substituent 
contributions come from interactions between three randomly chosen distinct descriptors of 



the substituent, dj£ , d^ 2 , and d£ 3 , and another three randomly chosen distinct descriptors 

of the core, D^ , D^ , and -D^ . Both jj and fcj are descriptor indices ranging from 1 to 6. 
Assuming that we have integrated out the degrees of freedom of the substrate, these indices 
depend only on the core. 

The parameters in the random energy model are chosen to mimic the complicated in- 
teractions between a small molecule and a substrate. We choose to focus on the case where 
these interactions are unpredictable, which is typical. That is, in a typical experiment, it 
would not be possible to predict the value of the screen in terms of molecular descriptors. 
Indeed, when rational design fails, an intelligent use of high-throughput experimentation is 
called for. The task of library design and redesign, rather than single molecule design, is the 
one we address in the next section. 

3 Monte Carlo Strategies and Diversity Measurement 

Before initiating the Monte Carlo protocol, we first build the core and substituent li- 
braries. We denote the size of the core library by Nq and the size of the substituent library 
by iVs- In a real experiment, the six descriptors would then be calculated for each core 
and substituent. In the simulated experiment, the values of the six descriptors of each sub- 
stituent and core are extracted from a Gaussian random distribution with zero mean and 
unit variance. In the simulated experiment, we also associate two sets of random interaction 
descriptor indices to each core for the interaction terms in eqs [7] and ||. 

To give a baseline for comparison, we first design the library using a random construc- 
tion. New molecules are constructed by random selection of one core and six substituents 
from the libraries. Since the properties of each substituent and core are assigned randomly, 
this first library should be reasonably diverse and comparable to examples in literature. 

For the Monte Carlo schemes, the initial molecular configurations are assigned randomly 
as before. The library is modified by the Monte Carlo protocol in subsequent rounds of high- 
throughput experimentation. Two kinds of move are possible for each molecule in the library, 
core changes and substituent changes. Either the core is changed with probability p C om, or 
one of the six substituents is picked randomly to change. We denote the probability of 
changing from core m to m! by Tim — > mf) and from substituent i to i! by tii —* i'). The 
new configurations are updated according to the acceptance rule at (3, the inverse of the 
protocol temperature. All the samples are sequentially updated in one Monte Carlo round. 

For the simple Metropolis method, the transition matrices are 

T{m->m!) = 1/N C (9) 

t{i^i') = 1/N a (10) 



and the acceptance rule is 



acc(o — > n) = min[l, exp(— (3/S.E)} (11) 



To make use of the idea that smaller moves are accepted more often, we could try to choose a 
modified substituent or core that is similar to current one, that is, we could use a transition 



matrix weighted towards those substituents or cores close to the current one in the six- 
dimensional descriptor space. Interestingly, this refinement turns out not to work any better 
than does the simple random move. It seems that even a small move in the descriptor 
space is already much larger than the typical distance between peaks on the figure of merit 
landscape. 

Biased Monte Carlo methods have been shown to improve the sampling of complex 
molecular systems by many orders of magnitude.^ In contrast to conventional Metropolis 
Monte Carlo, trial moves in biased schemes are no longer chosen completely at random. By 
generating trial configurations with a probability that depends on a priori knowledge, the 
moves are more likely to be favorable and more likely to be accepted. As we are dealing 
with a discrete configurational space, the implementation of biased Monte Carlo in this 
case is relatively simple. First, we need a biasing term for both substituents and cores. 
Since the form of this term is not unique, we can proceed in several different ways. One 
strategy is to bias our choice of core and substituent on the individual contributions of the 
cores and substituents to the figure of merit. We might know, or be able to estimate, these 
contributions from theory. For the random energy model, for example, we know 

e« = a 1 F(4 i) ,...,4 i) ,{G s }) (12) 

EM = a 2 F(D<r\... 7 Dt\{Gc}) (13) 

where e^ is the bias energy to the substituent i in the library, and E^ the bias energy of 
core m in the library. Alternatively, we can estimate the contribution of each substituent or 
core to the figure of merit experimentally." ED A electrospray ionization source coupled to a 
mass spectrometer, for example, can serve this purpose.O To measure the contributions, we 
do a pre-experiment on 10000 randomly constructed molecules. This number of compounds 
will give on average each substituent 60 hits and each core 667 hits. By averaging the figure 
of merit of the molecules containing a particular substituent or core over the total number of 
hits, we can obtain experimental estimates of eW and E^ m \ Using these two methods of bias, 
we construct three different types of biased Monte Carlo schemes: theoretical biased move, 
experimental biased move, and mixed biased move. In theoretical bias, both e^> and E^ 
are from the random energy model. In experimental bias, both e^' and E^ m ' are calculated 
from the pre-experiment. In mixed bias, e^ comes from the random energy model, while 
£j[m) comes from the pre-experiment. 

These biases tend to exhibit a large gap between a few dominant cores and substituents 
and the rest. To ensure the participation of more substituents and cores in the strategy, we 
introduce cutoff energies for the substituent and core, e c and E c . We arbitrarily choose e c to 
be the 21st lowest substituent energy and E c to be the 4th lowest core energy. The biased 
energy, e£ , , for the i th substituent is 



e W if e («) > e c 
otherwise 



e<£ ] = \^' U ^ ' ". Cc (14) 



And the biased energy, E^ m , for the m th core is 



Et> = ( f" *f m> > E < (15) 

I hi c otherwise 
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To correct for this bias, we introduce Rosenbluth factors.E3 Since the transition proba- 
bilities are the same at each Monte Carlo step, we have a constant Rosenbluth factor for the 
substituent 

Ns d) 

w(n) = w(o) = y^exp(— (3e h ) (16) 

i=l 

The probability of transition from substituent i to i' is 

t(i - = eXp( ~^ 0) (17) 

w(n) 

In the same way, the Rosenbluth factor for the core is 

N c 



W{n) = W(o) = J2 exp(-AE£ 

m=l 

The probability of transition from core m to m' is 



T(m^m) = — ^ (19) 

Finally, we define the remaining, non-biased part of the figure of merit to be 

E h = E - Ei m) - £ e^ (20) 



To satisfy the detail balance, the acceptance rule becomes 

acc(o — ► n) = min[l,exp(— ftAE^)] (21) 

We add a swap move that attempts to exchange fragments between two molecules to 
the set of Monte Carlo moves. We denote the probability of attempting a swap instead 
of a single- molecule move as p swap . In a swap move, the cores or a pair of substituents 
may be swapped between two randomly selected molecules. The probability of switching 
the core or substituent at the same position is given by p swa p c and p S wa PS ; respectively. 
The crossover event from genetic algorithms could also be introduced in the swap moves, 
but this additional move did not improve the results. The acceptance rule for swapping is 
acc(o — > n) = min[l, exp(—@AE)]. 

Parallel tempering is known to be a powerful tool for searching rugged energy landscapes.!? 
In parallel tempering, the samples are divided into k groups. The first group of samples is 
simulated at f3\, the second group is at /5 2 , and so on, with (3\ < j3 2 < ■ ■ ■ < Pk- At the end of 
each round, samples in group % < k are allowed to exchange configurations with samples in 
group 2+1 with probability p cx . The corresponding acceptance rule for a parallel tempering 
exchange is 

acc(o -»• n) = min[l, exp(A/3 AE)] (22) 

where A/3 = (3j — /3j +1 and AE is the difference in energy between the sample in group % and 
the sample in group i + 1. It is important to notice that this exchange step is experimentally 



cost-free. Nonetheless, this step can be dramatically effective at facilitating the protocol to 
escape from local minimum. The number of groups, the number of samples in each group, 
the value of $, and the exchange probability, p ex , are experimental parameters to be tuned. 

For comparison, we compare these Monte Carlo protocols to a standard genetic algo- 
rithm approacli.E^H^I In the genetic algorithm, as in the Monte Carlo strategy, we perform 
multiple rounds of experimentation on a large set of compounds. The difference between the 
Monte Carlo and the genetic algorithm lies in how the library is redesigned, that is, how the 
compounds are modified in each round. In the genetic algorithm, first we randomly select 
two parents. Then we list the explicit composition of each molecule, i.e. core, substituent 
1, . . . , substituent 6. After aligning the sequences from the two parents, we make a random 
cut and exchange part of the sequences before the cut. We also allow for random changes, 
or mutation, in the cores or substituents of the offsprings. Finally, since the population is 
doubled by crossover, we select the better half of the molecules to survive this procedure 
and continue on to the next round. 

The diversity of the library as it passes through the rounds of high-throughput experi- 
mentation is an important quantity. We calculate the diversity, T>, as the standard deviation 
of the library in the 42-dimensional descriptor space. 



1 N 



j=l j=l k=l 



(23) 



where m(i) is the index of the core of molecule i, Sk{i) is the index of substituent k of molecule 
i, and j is the index for the descriptor. The average value in each descriptor dimension is 
given by (Dj) = N' 1 J2? =1 L>j m(i)) and (<$ k) ) = iV" 1 Eili d { - k{i)) . The diversity of the library 
will change as the library changes. A larger library will generally possess a higher absolute 
diversity simply due to the increased number of compounds. This important, but trivial, 
contribution to the diversity is scaled out by the factor of 1/N in eq ^3|. 

4 Results 

To gauge how the synergistic terms in the figure of merit affect the efficiency of the 
Monte Carlo protocols, we consider three models with increasingly important synergistic 
effects. We do this by adjusting the aj in eqs ||-|| so that the absolute values of the terms 
are on average in the ratio Eg : Eq : E^ : Eqq, = 1:1: 0.5 : 0.3 in model I, 1 : 1 : 1 : 0.6 
in model II, and 1 : 1 : 2 : 1.2 in model III. Finally, we set ot\ = 0.01 arbitrarily in model I. 
To maintain the same statistical magnitude of total energy, we set a± = 0.00778 in model II 
and «i = 0.00538 in model III. 

The size of the library is fixed at Nq = 15 and N$ = 1000. The compositional space of 
this model has 15 x 1000 6 distinct molecules. Clearly, it is impossible to search exhaustively 
even this modestly complex space. We fix the total number of molecules to be synthesized 
at 100000, that is, all protocols will have roughly the same experimental cost. Specifically, 
100000 molecules will be made in the random library design protocol, while in the case of the 
Monte Carlo or genetic protocols, the number of molecules times the number of simulation 
rounds is kept fixed at 100000. 



To locate optimal parameters for the protocols, we perform a few short pre-experiments. 
We first fix the energy coefficients in the energy function and the descriptors of the substituent 
and core libraries. For simple Metropolis, we find that it is optimal to use 10 samples with 
10000 rounds, suggesting that the system is still far from equilibrium at the random initial 
configuration. With the biased Monte Carlo method, we find that 100 samples and 1000 
rounds is optimal. We focus on systems with 1000 or 100 rounds, since fewer rounds are 
typically preferred in experiments. It is more difficult to achieve effective sampling in the 
system with 100 rounds, and so we use this system when setting optimal parameter values. 
For parallel tempering, it was optimal to have the samples divided into three subsets, with 
30% of the population at /?i, 40% at /3 2 , and 30% at (3 3 . The optimal parameters are listed in 
Table I for each model. Determination of these parameter values corresponds experimentally 
to gaining familiarity with the protocol on a new system. 

The various Monte Carlo schemes are compared with the random selection method 
and the genetic algorithm. Once the optimal parameters are chosen, the coefficients of the 
energy function and the descriptor values of the substituent and core libraries are generated 
differently in each simulated experiment. The simulation results for the three models are 
shown in Figures |2|-f|. Each data point in the figure is an average over 20 independent runs. 
This averaging is intended to give representative performance of the protocols on various 
figures of merit of experimental interest. Since there is much randomness in the results, the 
standard deviation of the average is shown as well. 

5 Discussion 

Although the average absolute values of the figures of merit in the three models are 
adjusted to be equal, the stronger the interaction terms, the greater the figure of merit we 
can find. For instance, the biased schemes find values of — E in the range 30-40 in model I, 
but values in the range 40-50 in model II and values in the range 50-60 in model III. This 
suggests that the figure of merit landscape has changed in detail as the synergistic effects 
in the model are adjusted. It is clear that for all systems, the Metropolis methods perform 
better than does random selection. The system with 1000 molecules and 100 rounds is not 
well-equilibrated by the Metropolis schemes, and an experiment with 100 molecules and 1000 
rounds significantly improves the optimal compounds identified. However, by incorporating a 
prion knowledge, the biased Monte Carlo schemes are able to equilibrate the experiment with 
either 1000 or 100 rounds. Interestingly, the theoretical bias and experimental bias methods 
yield similar results. This strongly suggests that a minimal number of pre-experiments can 
be very useful, both for the understanding of the structure of the figure of merit landscape 
and for improving the performance in future rounds. 

The results produced with the composite moves including swap and parallel tempering 
are slightly improved relative to those from the plain Monte Carlo schemes. Typically, 
however, these composite moves significantly improve the sampling of a rough landscape. 
Indeed, swapping and crossover moves are very effective in protein molecular evolution, 
where the variable space is extremely large. E3 Perhaps the variable space is not so large in 
small molecule high-throughput experimentation that these composite moves are required. 
Alternatively, the random energy model may underestimate the ruggedness of the landscape. 
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The landscape for RNA substituents, for example, is estimated to be extremely rough,E3 and 
composite moves may prove more important in this case. 

The genetic algorithm is relatively easy to use. It does not satisfy detailed balance, 
however, so there is no theoretical guarantee of the outcome. The optimal figures of merit 
identified are, nonetheless, comparable to those from the better Monte Carlo methods for all 
three models. However, due to the crossover and selection steps in the genetic algorithm, the 
molecules in the library tend to become similar to each other, which prevents this scheme 
from sampling the whole variable space. To help elucidate this point, diversity measurements 
for model I are shown in Figure |5|. It is clear that the genetic algorithm has reduced the 
diversity of the library by 60% relative to the biased schemes. Interestingly, the Monte 
Carlo simulations actually increase the diversity from the initial random configurations. The 
biased schemes tend to bring the system to equilibrium relatively quickly, and the diversity 
measurements are similar for the 100 and 1000 round experiments. For the Metropolis 
method, on the other hand, an experiment with 100 rounds is less diverse than an experiment 
with 1000 rounds. The genetic algorithm approach finds less favorable figure of merit values 
in the 100 compound 1000 round experiment, presumably due to a greater sensitivity to the 
vlO reduction in the absolute diversity relative to the already small absolute diversity in 
the 1000 compound 100 round experiment. 

The greater the number of potentially favorable molecules in the library space, the 
greater the diversity of the experimental library will be for the Monte Carlo methods. The 
genetic algorithm, on the other hand, will tend to produce a library that contains many 
copies of a single favorable molecule. A key distinction, then, is that a Monte Carlo strategy 
will sample many compounds from the figure of merit landscape, whereas a genetic algorithm 
will tend to produce a single molecule with a favorable figure of merit value. How strongly the 
compounds with high figures of merit are favored in the Monte Carlo strategy is determined 
by the protocol temperature, since the probability of observing a compound with figure of 
merit —E is proportional to exp(—0E). The sampling achieved by the Monte Carlo methods 
is important not only because it assures that the composition space is thoroughly sampled, 
but also because it assures that the library of final hits will be as diverse as possible. 

The Monte Carlo methods perform equally well on all three models. The three models 
were introduced to gauge the impact of unpredictable, synergistic effects in the experimental 
figure of merit. It might be expected that the a priori bias methods would perform less well 
as the synergistic effects become more pronounced. That the biased methods perform well 
even in model III suggests that the Monte Carlo approach may be rather robust. In other 
words, even a limited amount of a priori information is useful in the Monte Carlo approach 
to library redesign. 

6 Conclusion 

Monte Carlo appears to be a fruitful paradigm for experimental design of multi-round 
combinatorial chemistry, or high-throughput, experiments. A criticism of high-throughput 
experimentation has been its mechanical structure and lack of incorporation of a priori 
knowledge. As shown here, a biased Monte Carlo approach handily allows the incorpora- 
tion of a priori knowledge. Indeed, our simulation results reveal that biased Monte Carlo 
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schemes greatly improve the chances of locating optimal compounds. For the moderately 
complex libraries considered here, the bias can be determined equally well by experimental or 
theoretical means. Although the compounds identified from a traditional genetic algorithm 
are comparable to those from the better Monte Carlo schemes, the diversity of identified 
molecules is dramatically decreased in the genetic approach. Genetic algorithms, therefore, 
are less suitable when the list of good molecules is further winnowed by a secondary screen, a 
tertiary screen, patentability considerations, lack of side effects, or other concerns. Interest- 
ingly, composite Monte Carlo moves such as swap or parallel tempering bring only a slight 
improvement to the plain biased Monte Carlo protocols, possibly due to the relatively small 
size of the composition space in small molecule high-throughput experimentation. Presum- 
ably, as the complexity of the library is increased, these composite moves will prove more 
useful for the more challenging figures of merit. Although we have here chosen the initial 
library configurations at random, the sophisticated initial library design strategies available 
in the literature can be used, and they would complement the multi-round library redesign 
strategies presented here. 
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Table 1: Optimal parameters used in simulations for the three random energy models. 
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Figure 1: Schematic view of the small molecule model. 
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Figure 2: Comparison of different Monte Carlo schemes with random and genetic schemes for 
energy model I (J5g : Eq : .Egg : -^cs = 1:1: 0.5 : 0.3). Data from two cases are shown, one 
with 1000 molecules and 100 rounds (filled diamonds) and one with 100 molecules and 1000 
rounds (unfilled squares). Only comparison between relative energy values is meaningful, as 
the energy scale is abritrary. 



17 



60 



50 - 



40 

W 30 

20 



10 - 



□1000 Rounds 
♦100 Rounds 



* i i 



*** 




71 


<* 


(V) 


Tl 


H 


W 


Tl 


m 


CO 


Tl 


<* 


CO 


Tl 


(?) 


0! 


CD 


? 


-H 


=J" 


> 


H 


X 


? 


H 


x' 

CD 


£ 


-H 


CD 


_) 




m 


+ 




m 


+ 




D) 


+ 


n> 


+ 


_1 


LL 


ri 


X3 


<* 




T3 


H 


DO 


TJ 


m 


T3 


<■ 


CD 


U 


"O 


+ 


CD 

— * 
O 

■c 
en' 




+ 


=r 


+ 


X 




+ 




— . 


=) 


O 

cn' 


2 

CD 
3 
O 
cn' 


DO 

03' 
Cfl 


H 
=r 
CD 
O 

00 

0)' 
c/> 


CD 
O 

□0 

Q)' 
en 


en 


m 

X 

-p 

00 
en 


p 

00 

03' 

en 


UJ 

Q)' 
cn 


2 

X 

CD 
Q. 

00 

5' 

en 


CD 

a. 
00 
5' 

cn 


> 

CO 
O 

^_ 

3 



Figure 3: Comparison of different Monte Carlo schemes with random and genetic schemes 
for energy model II (E$ : Eq '■ -Ess : -^cs = 1:1:1: 0.6). Data from two cases are shown, 
one with 1000 molecules and 100 rounds (filled diamonds) and one with 100 molecules and 
1000 rounds (unfilled squares). 



LU 



70 
60 
50 
40 
30 
20 
10 




□1000 Rounds 
♦100 Rounds 



Hi 



1,1 



<> 6 



73 
03 
=5 
Q. 
O 



C/D Tl -H C/5 Tl 



03 

+ 



(0 (B 

O 



CD 
O 



03 
T3 

+ 

=f oo -j 



o 



03 
Cfl 



03' 



CD 
O 



03' 
cn 



m 

X 

p 
ro 

03' 

en 



(f> Tl 



03 

+ 

m 

X 

p 

03' 

en 



+ 
m 

X 

p 
ro 

03' 

cn 



x 

CD 
Q. 

ro 

03' 
cn 



03 

■a 

+ 



x 

CD 
Q. 

CD 

03' 

cn 



00 Tl 



X 

CD 
Q. 

CD 

03' 

cn 



CD 

CD 

=S 
CD 



CO 
O 



Figure 4: Comparison of different Monte Carlo schemes with random and genetic schemes 
for energy model III (J5g : Eq : E^s : -^cs = 1:1:2: 1.2). Data from two cases are shown, 
one with 1000 molecules and 100 rounds (filled diamonds) and one with 100 molecules and 
1000 rounds (unfilled squares). 
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Figure 5: Diversity measurement of the final configurations for model I. Data from two 
cases are shown, one with 1000 molecules and 100 rounds (filled diamonds) and one with 100 
molecules and 1000 rounds (unfilled squares). The error bars are negligible. The contribution 
to the absolute diversity that scales as the square root of the number of molecules per round 
has been scaled out in this figure, as in eq |23|. 
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